First of all, a new directory in the DNA High-Performance Computing Cluster was created, which is
/mnt/Timina/bioinfoII/amarin/RNAseq2/
This directory has the next organization:
|-data/ # Directory to download the raw data
|-data_trimmed/ # Directory to save the post-QC data
|-kallisto_out/ # Directory to save the kallisto outputs
|-QC/ # Directory to perform raw QC
|-QC_trimmed/ # Directory to perform QC on the trimmed data
|-scripts/ # Some scripts used to send jobs
The dataset choose to analysis has de GEO Accession GSE91061. Since downloading from the NCBI may be difficult and quite technical, it was choosen to use the European mirror, the European Nucleotide Archive from the EMBL-EBI. By browing on the GEO web page, we see that the data are from the BioProject PRJNA356761. We use that ID to search on ENA.
Due the fact that the are a lot of file needed to be downloaded, we use the web application SRA Explorer to get a script to get the bulk download.
We then select all
the files and click on “Add to collection”. After that, we see our saves
datasets and select the bash script for downloading FastQ files option.
We save the script with the name
downloadScript.sh, and
save it in the HPCC, in the //data/ directory.
Then, we send a job using the script. It lasted around 5 days.
Once the data is ready, we perform quality control. We just send a
job using fastqc with all the fastq files via a
wildcard:
fastqc /mnt/Timina/bioinfoII/amarin/RNAseq2/data/*fastq.gz -o /mnt/Timina/bioinfoII/amarin/RNAseq2/QC
Then, we used multiqc to visualize all the samples at once:
multiqc /mnt/Timina/bioinfoII/amarin/RNAseq2/QC -o /mnt/Timina/bioinfoII/amarin/RNAseq2/QC
This allowed us to see some irregularities. First, the first to nucleotides of each read had a smaller mean quality score, compared to the rest of the read.
Besides, in the heatmap, the base sequence sequence content seemed to be biased on the first base pairs of the read.
All the samples resembled more or less the next percentage of base content.
Nevertheless, MultiQC didn’t detect any adapter content. Based on that, we decided to trimm the first 12 nucleotides of each pair.
We send a job in the //data/ directory with a script
using the software Trimmomatic:
for i in *_1.fastq.gz;
do echo
trimmomatic PE -threads 40 -phred33 -trimlog trim.log \
$i "${i%_1.fastq.gz}_2.fastq.gz" \
../data_trimmed/"${i%_1.fastq.gz}_1_trimmed.fastq.gz" \
../data_trimmed/"${i%_1.fastq.gz}_1_unpaired.fastq.gz" \
../data_trimmed/"${i%_1.fastq.gz}_2_trimmed.fastq.gz" \
../data_trimmed/"${i%_1.fastq.gz}_2_unpaired.fastq.gz" \
HEADCROP:12
done
Here we just cut the first 12 nucleotides of each read and create a
trim.log log file by the use of 40 threads. We know that
the Phred score are based on ASCII 33 because previously we found a
# symbol on the reads.
This job took a couple of days.
Once the data was trimmed, we did one more time the quality control.
fastqc /mnt/Timina/bioinfoII/amarin/RNAseq2/data_trimmed/*fastq.gz \
-o /mnt/Timina/bioinfoII/amarin/RNAseq2/QC_trimmed
multiqc /mnt/Timina/bioinfoII/amarin/RNAseq2/QC_trimmed \
-o /mnt/Timina/bioinfoII/amarin/RNAseq2/QC_trimmed
This time, the quality of the first pair bases was better.
Besides, the “per base sequence content” was less biased:
Each sample was more homogeneous.
Besides, other indicators, such as the “per base N content” and “per sequence GC content” performed better.
Once the data was ready, we decided to generate count matrices using Kallisto. To perform this, we first need a reference transcriptome. We choose the latest version available at GENCODE, which is the release 43. The file was downloaded at the parent directory of this project:
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.transcripts.fa.gz
Then, generated the kallisto index:
kallisto index -i /mnt/Timina/bioinfoII/amarin/RNAseq2/kallisto_out/Hs_ref.kidx \
/mnt/Timina/bioinfoII/amarin/RNAseq2/gencode.v43.transcripts.fa.gz
Then, we used a bash script to perform the quantification with all the samples.
for file in /mnt/Timina/bioinfoII/amarin/RNAseq2/data_trimmed/*_1_trimmed.fastq.gz
do
op=$(basename $file | cut -c-10 )
file2=$(echo $file | sed 's/_1_/_2_/')
mkdir /mnt/Timina/bioinfoII/amarin/RNAseq2/kallisto_out/${op}
kallisto quant -i /mnt/Timina/bioinfoII/amarin/RNAseq2/kallisto_out/Hs_ref.kidx \
-o /mnt/Timina/bioinfoII/amarin/RNAseq2/kallisto_out/${op} \
--threads 40 $file $file2
done
For the differential expression analysis we used DESeq2, a Bioconductor package running on R 4.3.0 using a PC. The next, are the libraries to use.
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tximport)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.1.8
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(DESeq2)
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
##
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
##
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
##
##
## Attaching package: 'S4Vectors'
##
## The following objects are masked from 'package:lubridate':
##
## second, second<-
##
## The following objects are masked from 'package:dplyr':
##
## first, rename
##
## The following object is masked from 'package:tidyr':
##
## expand
##
## The following object is masked from 'package:plotly':
##
## rename
##
## The following object is masked from 'package:utils':
##
## findMatches
##
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
##
## The following object is masked from 'package:lubridate':
##
## %within%
##
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
##
## The following object is masked from 'package:purrr':
##
## reduce
##
## The following object is masked from 'package:plotly':
##
## slice
##
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
##
## The following object is masked from 'package:dplyr':
##
## count
##
##
## Attaching package: 'MatrixGenerics'
##
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
##
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
##
## Attaching package: 'Biobase'
##
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
##
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
## Warning: replacing previous import 'S4Arrays::read_block' by
## 'DelayedArray::read_block' when loading 'SummarizedExperiment'
library(ggplot2)
library(ggrepel)
library(rhdf5)
library(GenomicFeatures)
## Loading required package: AnnotationDbi
##
## Attaching package: 'AnnotationDbi'
##
## The following object is masked from 'package:dplyr':
##
## select
##
## The following object is masked from 'package:plotly':
##
## select
library(dplyr)
library(biomaRt)
library(gprofiler2)
library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.16.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
##
## If you use it in published research, please cite either one:
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
## genomic data. Bioinformatics 2016.
##
##
## The new InteractiveComplexHeatmap package can directly export static
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
##
##
## Attaching package: 'ComplexHeatmap'
##
## The following object is masked from 'package:plotly':
##
## add_heatmap
library(genefilter)
##
## Attaching package: 'genefilter'
##
## The following object is masked from 'package:ComplexHeatmap':
##
## dist2
##
## The following objects are masked from 'package:MatrixGenerics':
##
## rowSds, rowVars
##
## The following objects are masked from 'package:matrixStats':
##
## rowSds, rowVars
##
## The following object is masked from 'package:readr':
##
## spec
library(clusterProfiler)
##
## clusterProfiler v4.8.1 For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
##
## If you use clusterProfiler in published research, please cite:
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
##
## Attaching package: 'clusterProfiler'
##
## The following object is masked from 'package:biomaRt':
##
## select
##
## The following object is masked from 'package:AnnotationDbi':
##
## select
##
## The following object is masked from 'package:IRanges':
##
## slice
##
## The following object is masked from 'package:S4Vectors':
##
## rename
##
## The following object is masked from 'package:purrr':
##
## simplify
##
## The following object is masked from 'package:stats':
##
## filter
library(AnnotationDbi)
library(org.Hs.eg.db)
##
#library(ggupset)
library(enrichplot)
First, we imported the kallisto quantification tables to a local computer from the DNA HPCC. Then, we created a tx2gene table, which maps annotated genes with transcripts. To do this, we used the Bioconductor package biomaRt.
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
tx2gene = getBM(attributes=c('ensembl_transcript_id','ensembl_gene_id'),
mart = ensembl)
Next, we import the quantification table to the R environment. To do this, we use the Bioconductor package rhdf5
## Paths to h5 archives
files = file.path(list.dirs("./kallisto_data", recursive = FALSE),
"abundance.h5")
names(files) = str_extract(files, "SRR\\d+")
## Importation
txi.kallisto.tsv = tximport(files, type = "kallisto", tx2gene = tx2gene,
ignoreAfterBar = TRUE, ignoreTxVersion=TRUE)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
## summarizing abundance
## summarizing counts
## summarizing length
head(txi.kallisto.tsv$counts)
## SRR5088813 SRR5088814 SRR5088815 SRR5088816 SRR5088817
## ENSG00000000003 1499.8897 1031.487 1312.8342 1808.0254 1579.2045
## ENSG00000000005 1.0000 0.000 0.0000 2.0000 0.0000
## ENSG00000000419 1861.0942 2168.997 1508.2203 1496.6707 2454.8492
## ENSG00000000457 963.2173 1055.715 641.8061 747.8741 556.1578
## ENSG00000000460 1011.8159 1148.560 1076.2147 1276.6267 515.2332
## ENSG00000000938 1112.0000 620.000 1809.0000 1215.0000 431.0000
## SRR5088818 SRR5088819 SRR5088820 SRR5088821 SRR5088822
## ENSG00000000003 1574.5759 5353.023 3867.1593 1036.6527 2788.7407
## ENSG00000000005 2.0000 0.000 4.0000 6.0000 0.0000
## ENSG00000000419 2549.8278 1708.095 2677.1232 2612.1233 1646.1536
## ENSG00000000457 524.0493 1587.810 833.4179 895.6544 739.3859
## ENSG00000000460 455.2848 1364.916 491.6654 946.9077 417.4794
## ENSG00000000938 992.0000 76.000 484.0000 1257.0000 661.0000
## SRR5088823 SRR5088824 SRR5088825 SRR5088826 SRR5088827
## ENSG00000000003 2868.4659 817.9287 294.2066 2170.6283 2477.1213
## ENSG00000000005 0.0000 187.0000 25.0000 3.0000 0.0000
## ENSG00000000419 1504.1887 1795.1807 1864.4394 13570.2126 3901.6615
## ENSG00000000457 931.6242 923.3614 1009.3774 1114.0828 698.9271
## ENSG00000000460 260.2280 346.5751 253.3507 716.0371 626.8997
## ENSG00000000938 483.0000 807.0000 3836.0000 175.0000 1183.0000
## SRR5088828 SRR5088829 SRR5088830 SRR5088831 SRR5088832
## ENSG00000000003 2961.8483 1959.4511 1420.5087 1617.4379 1480.6072
## ENSG00000000005 3.0000 0.0000 5.0000 15.0000 5.0000
## ENSG00000000419 3392.4707 3074.5196 2252.7432 2081.0145 1921.6448
## ENSG00000000457 912.6742 929.7829 748.5641 924.7083 679.3888
## ENSG00000000460 696.6745 635.6388 513.0464 578.0808 405.9423
## ENSG00000000938 618.0000 84.0000 176.0000 456.0000 498.0000
## SRR5088833 SRR5088834 SRR5088835 SRR5088836 SRR5088837
## ENSG00000000003 1367.503 540.4137 530.0340 981.7885 1221.375
## ENSG00000000005 0.000 0.0000 2.0000 3.0000 0.000
## ENSG00000000419 1967.910 2395.5939 2627.1136 1937.1176 1863.499
## ENSG00000000457 1006.676 1601.0491 1999.9430 879.7708 1003.785
## ENSG00000000460 1001.351 548.7528 715.2732 580.5069 613.107
## ENSG00000000938 461.000 3301.0000 3747.0000 886.0000 488.000
## SRR5088838 SRR5088839 SRR5088840 SRR5088841 SRR5088842
## ENSG00000000003 5246.5285 4371.1350 1386.6904 1667.9434 2370.8907
## ENSG00000000005 0.0000 0.0000 1.0000 0.0000 4.0000
## ENSG00000000419 2203.0075 1880.9251 1070.5892 1093.1452 4496.7546
## ENSG00000000457 920.7094 837.1885 611.6103 584.0139 574.4228
## ENSG00000000460 526.7809 448.6499 130.6015 190.1884 282.7219
## ENSG00000000938 169.0000 161.0000 43.0000 50.0000 574.0000
## SRR5088843 SRR5088844 SRR5088845 SRR5088846 SRR5088847
## ENSG00000000003 1773.1713 1737.9675 586.3445 917.7317 734.9708
## ENSG00000000005 9.0000 5.0000 0.0000 7.0000 18.0000
## ENSG00000000419 8293.1696 2034.9443 2283.5735 1823.0789 2212.8519
## ENSG00000000457 672.9638 1112.1147 665.1059 1094.2700 1346.8390
## ENSG00000000460 483.8449 268.2798 233.3754 611.2008 533.3934
## ENSG00000000938 455.0000 737.0000 570.0000 850.0000 1266.0000
## SRR5088848 SRR5088849 SRR5088850 SRR5088851 SRR5088852
## ENSG00000000003 6306.3849 5074.493 606.6871 866.3972 862.9225
## ENSG00000000005 34.0000 0.000 6.0000 18.0000 0.0000
## ENSG00000000419 3604.5185 3582.646 2216.1319 2180.0030 2046.2397
## ENSG00000000457 1556.8109 1279.719 633.8067 565.3440 1153.2206
## ENSG00000000460 892.5964 1244.861 314.9208 345.0006 594.8391
## ENSG00000000938 131.0000 106.000 48.0000 53.0000 1199.0000
## SRR5088853 SRR5088854 SRR5088855 SRR5088856 SRR5088857
## ENSG00000000003 2671.1411 2074.6614 3658.8181 2655.2742 2136.568
## ENSG00000000005 0.0000 0.0000 0.0000 0.0000 0.000
## ENSG00000000419 1734.9773 2122.9492 3652.9663 4202.0157 3091.590
## ENSG00000000457 917.4060 914.4429 941.8504 779.9289 1108.435
## ENSG00000000460 335.7866 365.9461 555.1274 483.9068 1349.219
## ENSG00000000938 246.0000 704.0000 151.0000 235.0000 457.000
## SRR5088858 SRR5088859 SRR5088860 SRR5088861 SRR5088862
## ENSG00000000003 1720.1813 2785.8307 295.9765 53.23744 3496.2914
## ENSG00000000005 0.0000 0.0000 0.0000 7.00000 307.0000
## ENSG00000000419 2762.7303 3231.6175 1099.6812 5575.89804 1813.6622
## ENSG00000000457 937.2847 920.3392 450.4721 749.86228 887.6986
## ENSG00000000460 1437.1010 1129.5838 223.7868 691.37841 398.9935
## ENSG00000000938 499.0000 143.0000 5332.0000 351.00000 1364.0000
## SRR5088864 SRR5088865 SRR5088866 SRR5088867 SRR5088870
## ENSG00000000003 3980.2444 1643.481 2584.2381 1452.240 1103.5252
## ENSG00000000005 0.0000 7.000 0.0000 0.000 6.0000
## ENSG00000000419 2347.0594 6531.794 1792.6321 1596.549 2176.1678
## ENSG00000000457 810.3796 1668.369 652.4703 1188.964 1130.4216
## ENSG00000000460 1294.1503 1855.273 630.7542 775.583 708.0209
## ENSG00000000938 14.0000 1654.000 1222.0000 470.000 744.0000
## SRR5088871 SRR5088872 SRR5088873 SRR5088874 SRR5088877
## ENSG00000000003 791.5707 507.0273 1194.8276 996.4261 2134.4051
## ENSG00000000005 15.0000 55.0000 1.0000 9.0000 4.0000
## ENSG00000000419 2004.4437 1322.5974 2261.5769 1841.4780 2490.3217
## ENSG00000000457 847.4552 532.1253 767.3336 673.1486 1002.7374
## ENSG00000000460 374.4083 135.0421 448.1326 333.8946 599.8539
## ENSG00000000938 187.0000 2330.0000 113.0000 161.0000 1105.0000
## SRR5088878 SRR5088879 SRR5088880 SRR5088882 SRR5088883
## ENSG00000000003 2213.230 1676.773 2196.0660 1786.568 1225.2293
## ENSG00000000005 0.000 260.000 0.0000 0.000 7.0000
## ENSG00000000419 2556.585 2974.618 4423.9343 2419.445 1679.1308
## ENSG00000000457 1090.094 1409.785 1472.7298 1269.951 900.6896
## ENSG00000000460 1428.240 524.393 807.6177 1119.651 787.5410
## ENSG00000000938 336.000 1190.000 915.0000 3033.000 1744.0000
## SRR5088885 SRR5088886 SRR5088887 SRR5088888 SRR5088889
## ENSG00000000003 2409.6886 2392.8711 6001.996 2483.610 1895.6341
## ENSG00000000005 98.0000 51.0000 5.000 0.000 44.0000
## ENSG00000000419 2047.6542 2265.3464 3124.770 1829.664 1609.2291
## ENSG00000000457 834.1540 897.3309 3008.593 1684.672 1387.6998
## ENSG00000000460 655.1501 816.8001 3229.875 1580.403 508.6059
## ENSG00000000938 357.0000 301.0000 253.000 861.000 43.0000
## SRR5088890 SRR5088891 SRR5088892 SRR5088893 SRR5088894
## ENSG00000000003 1833.9203 333.4795 376.6351 2.085262 2429.6335
## ENSG00000000005 13.0000 0.0000 8.0000 0.000000 0.0000
## ENSG00000000419 1904.2238 1964.2980 2191.3371 1168.466344 2297.8610
## ENSG00000000457 1322.3372 976.8201 1450.7139 310.014678 1070.1167
## ENSG00000000460 570.3542 636.9625 763.8163 61.429664 712.8822
## ENSG00000000938 41.0000 1248.0000 1927.0000 23942.000000 164.0000
## SRR5088895 SRR5088896 SRR5088897 SRR5088898 SRR5088899
## ENSG00000000003 2278.8957 3603.5690 3056.4739 1045.5592 1010.5119
## ENSG00000000005 1.0000 0.0000 1.0000 154.0000 0.0000
## ENSG00000000419 2068.4917 1153.9548 996.5107 1283.3244 1624.0418
## ENSG00000000457 916.8713 1081.8873 935.1838 683.9812 653.9129
## ENSG00000000460 657.7822 497.3166 377.2150 360.0222 560.0213
## ENSG00000000938 92.0000 384.0000 319.0000 1804.0000 6096.0000
## SRR5088900 SRR5088903 SRR5088904 SRR5088905 SRR5088906
## ENSG00000000003 1902.6976 2266.9754 578.9869 554.5478 538.7689
## ENSG00000000005 11.0000 4.0000 8.0000 12.0000 117.0000
## ENSG00000000419 1521.8528 3423.1362 1390.3848 1377.5183 2213.9150
## ENSG00000000457 1020.4122 1147.7186 568.0432 459.7878 538.8831
## ENSG00000000460 611.4256 731.3597 355.8647 243.3910 360.3783
## ENSG00000000938 412.0000 890.0000 5539.0000 6648.0000 6147.0000
## SRR5088907 SRR5088908 SRR5088909 SRR5088910 SRR5088911
## ENSG00000000003 1429.9228 1125.8278 5927.2194 452.8942 3135.619
## ENSG00000000005 6.0000 17.0000 0.0000 0.0000 2.000
## ENSG00000000419 2277.1880 2124.2340 1776.4331 1367.2127 1670.511
## ENSG00000000457 764.3815 720.4699 918.0422 773.0971 1974.626
## ENSG00000000460 737.4045 642.1742 1320.0837 242.6844 2198.253
## ENSG00000000938 1322.0000 1597.0000 743.0000 5387.0000 138.000
## SRR5088912 SRR5088913 SRR5088914 SRR5088915 SRR5088916
## ENSG00000000003 1165.7217 2273.3066 1903.7784 3313.8153 1987.0716
## ENSG00000000005 5.0000 132.0000 8.0000 0.0000 35.0000
## ENSG00000000419 1758.3253 1741.7684 1679.5253 2000.8187 2285.5451
## ENSG00000000457 836.2455 905.8524 900.0133 917.5498 998.4041
## ENSG00000000460 636.3842 219.3707 693.5904 778.4120 693.2936
## ENSG00000000938 819.0000 311.0000 368.0000 469.0000 975.0000
## SRR5088917 SRR5088918 SRR5088919 SRR5088920 SRR5088921
## ENSG00000000003 1958.6521 2816.8639 2825.0211 388.6590 820.3042
## ENSG00000000005 9.0000 18.0000 0.0000 5.0000 10.0000
## ENSG00000000419 1869.2832 2442.3880 3731.7272 1763.7044 1862.8944
## ENSG00000000457 1176.0046 1283.6872 574.4145 1214.9047 1164.8337
## ENSG00000000460 681.3052 679.8709 517.7314 622.6102 548.7739
## ENSG00000000938 619.0000 506.0000 67.0000 91.0000 55.0000
## SRR5088922 SRR5088923 SRR5088924 SRR5088925 SRR5088926
## ENSG00000000003 2232.9844 1757.3521 5112.921 1396.4045 2290.002
## ENSG00000000005 14.0000 53.0000 7064.000 130.0000 18.000
## ENSG00000000419 2172.8916 3236.8590 2611.180 1147.5355 1901.407
## ENSG00000000457 830.8325 770.4955 1333.935 449.7312 1691.924
## ENSG00000000460 823.1905 791.6204 1470.649 189.0400 1543.972
## ENSG00000000938 166.0000 756.0000 520.000 430.0000 374.000
## SRR5088927 SRR5088928 SRR5088929 SRR5088930
## ENSG00000000003 1976.776 452.9074 599.5753 1529.0688
## ENSG00000000005 18.000 8.0000 9.0000 0.0000
## ENSG00000000419 1825.588 1152.3027 978.7727 1482.7561
## ENSG00000000457 1423.880 584.0388 562.9419 617.0739
## ENSG00000000460 1130.280 263.8671 257.5823 648.2131
## ENSG00000000938 442.000 5434.0000 3327.0000 1344.0000
Next, we create a metadata table to know the treatment for each SRA sample. Since it wasn’t available on SRA Run Selector web page, we needed to manually create it.
## Metadata importation to R
meta = read.csv("./metadata.csv", header = FALSE)
samples = column_to_rownames(meta, var = "V1")
Now, we create a DESeq2 dataset-like object, called DESeqDataSet, using the imported counts, sample names and metadata table.
dds = DESeqDataSetFromTximport(txi.kallisto.tsv,
colData = samples,
design = ~V2)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
## using counts and average transcript lengths from tximport
Before doing some exploration analysis, we pre-filter low counts.
## Pre-filtering
keep = rowSums(counts(dds)) >= 10
dds = dds[keep, ]
## Factor level
dds$V2 = factor(dds$V2, levels = c("Pre", "On"))
To identify if there’s a batch effect, we graph a PCA; normalizing the data with the variance stabilizing transformation.
vst = vst(dds, blind = FALSE)
## using 'avgTxLength' from assays(dds), correcting for library size
Once we hve the normalized data, we graph a PCA.
pca = prcomp(t(assay(vst)), rank. = 3)
components = pca[["x"]]
components = data.frame(components)
var = summary(pca)[["importance"]]['Proportion of Variance',]
var = 100*sum(var)
fig = plot_ly(components, x = ~PC1, y = ~PC2, z = ~PC3, color = dds$V2)
fig = fig %>% layout(title = "PCA")
fig
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
As we can observe, there is not a visible cluster suggesting a batch effect. In fact, by some reason, almos all the samples cluster into a big cloud, regardless to their treatment.
Once we know there are not batch effects in our data, we perform the
Differential
Expression Analysis.
dds = DESeq(dds)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 8421 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
To consider significant genes, we will use a significance level of \(\alpha=0.01\), and a log fold change expression greater than 1 or lower than -1.
res = results(dds, alpha = 0.01)
## Significative genes
resSig = subset(res, padj < 0.01)
dim(resSig)
## [1] 80 6
We have 80 significant genes. We can visualize them in a volcano plot.
res.df = as.data.frame(res)
## Add genes symbols
res.df$symbols = mapIds(org.Hs.eg.db, keys = rownames(res.df),
keytype = "ENSEMBL", column = "SYMBOL")
### Up and down genes
res.df$diffexpressed = "NO"
res.df$diffexpressed[res.df$log2FoldChange > 1 & res.df$padj < 0.01] <- "UP"
res.df$diffexpressed[res.df$log2FoldChange < -1 & res.df$padj < 0.01] <- "DOWN"
ggplot(data = res.df, aes(x=log2FoldChange, y=-log10(padj), col=diffexpressed, label=symbols)) +
geom_point()
## Warning: Removed 19353 rows containing missing values (`geom_point()`).
To make the GO terms enrichment, we use the org.Hs.eg.db
database, containing the annotated genes from Homo sapiens. The
ontology was made using the Biologicla Process (BP).
genes = rownames(resSig[resSig$log2FoldChange >= 1, ])
GO = enrichGO(gene = genes,
OrgDb = org.Hs.eg.db,
keyType = "ENSEMBL",
ont = "BP")
GO.df = as.data.frame(GO)
head(GO.df)
## ID Description GeneRatio BgRatio
## GO:0006953 GO:0006953 acute-phase response 7/60 58/20772
## GO:0042178 GO:0042178 xenobiotic catabolic process 5/60 30/20772
## GO:0006721 GO:0006721 terpenoid metabolic process 7/60 119/20772
## GO:0008202 GO:0008202 steroid metabolic process 10/60 363/20772
## GO:0002526 GO:0002526 acute inflammatory response 7/60 131/20772
## GO:0006805 GO:0006805 xenobiotic metabolic process 7/60 131/20772
## pvalue p.adjust qvalue
## GO:0006953 3.132619e-10 4.388799e-07 3.271114e-07
## GO:0042178 2.286453e-08 1.601660e-05 1.193769e-05
## GO:0006721 5.091813e-08 2.305916e-05 1.718674e-05
## GO:0008202 8.142519e-08 2.305916e-05 1.718674e-05
## GO:0002526 9.875443e-08 2.305916e-05 1.718674e-05
## GO:0006805 9.875443e-08 2.305916e-05 1.718674e-05
## geneID
## GO:0006953 ENSG00000055955/ENSG00000132693/ENSG00000197249/ENSG00000228278/ENSG00000229314/ENSG00000241635/ENSG00000257017
## GO:0042178 ENSG00000140505/ENSG00000160868/ENSG00000165841/ENSG00000241635/ENSG00000255974
## GO:0006721 ENSG00000025423/ENSG00000140505/ENSG00000160868/ENSG00000165841/ENSG00000241635/ENSG00000244122/ENSG00000248144
## GO:0008202 ENSG00000025423/ENSG00000073734/ENSG00000140505/ENSG00000160868/ENSG00000165841/ENSG00000175336/ENSG00000180432/ENSG00000241635/ENSG00000244122/ENSG00000255974
## GO:0002526 ENSG00000055955/ENSG00000132693/ENSG00000197249/ENSG00000228278/ENSG00000229314/ENSG00000241635/ENSG00000257017
## GO:0006805 ENSG00000073734/ENSG00000140505/ENSG00000160868/ENSG00000165841/ENSG00000241635/ENSG00000244122/ENSG00000255974
## Count
## GO:0006953 7
## GO:0042178 5
## GO:0006721 7
## GO:0008202 10
## GO:0002526 7
## GO:0006805 7
sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Pop!_OS 22.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/Mexico_City
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] enrichplot_1.20.0 org.Hs.eg.db_3.17.0
## [3] clusterProfiler_4.8.1 genefilter_1.82.1
## [5] ComplexHeatmap_2.16.0 gprofiler2_0.2.1
## [7] biomaRt_2.56.0 GenomicFeatures_1.52.0
## [9] AnnotationDbi_1.62.1 rhdf5_2.44.0
## [11] ggrepel_0.9.3 DESeq2_1.40.1
## [13] SummarizedExperiment_1.30.1 Biobase_2.60.0
## [15] MatrixGenerics_1.12.0 matrixStats_0.63.0
## [17] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0
## [19] IRanges_2.34.0 S4Vectors_0.38.1
## [21] BiocGenerics_0.46.0 lubridate_1.9.2
## [23] forcats_1.0.0 stringr_1.5.0
## [25] dplyr_1.1.0 purrr_1.0.1
## [27] readr_2.1.4 tidyr_1.3.0
## [29] tibble_3.1.8 tidyverse_2.0.0
## [31] tximport_1.28.0 plotly_4.10.1
## [33] ggplot2_3.4.1
##
## loaded via a namespace (and not attached):
## [1] splines_4.3.0 BiocIO_1.10.0 bitops_1.0-7
## [4] ggplotify_0.1.0 filelock_1.0.2 polyclip_1.10-4
## [7] XML_3.99-0.14 lifecycle_1.0.3 doParallel_1.0.17
## [10] lattice_0.21-8 MASS_7.3-59 crosstalk_1.2.0
## [13] magrittr_2.0.3 sass_0.4.5 rmarkdown_2.20
## [16] jquerylib_0.1.4 yaml_2.3.7 cowplot_1.1.1
## [19] DBI_1.1.3 RColorBrewer_1.1-3 zlibbioc_1.46.0
## [22] ggraph_2.1.0 RCurl_1.98-1.12 yulab.utils_0.0.6
## [25] tweenr_2.0.2 rappdirs_0.3.3 circlize_0.4.15
## [28] GenomeInfoDbData_1.2.10 tidytree_0.4.2 annotate_1.78.0
## [31] codetools_0.2-19 DelayedArray_0.26.2 DOSE_3.26.1
## [34] xml2_1.3.3 ggforce_0.4.1 tidyselect_1.2.0
## [37] shape_1.4.6 aplot_0.1.10 farver_2.1.1
## [40] viridis_0.6.3 BiocFileCache_2.8.0 GenomicAlignments_1.36.0
## [43] jsonlite_1.8.4 GetoptLong_1.0.5 ellipsis_0.3.2
## [46] tidygraph_1.2.3 survival_3.5-5 iterators_1.0.14
## [49] foreach_1.5.2 tools_4.3.0 progress_1.2.2
## [52] treeio_1.24.0 Rcpp_1.0.10 glue_1.6.2
## [55] gridExtra_2.3 xfun_0.39 qvalue_2.32.0
## [58] withr_2.5.0 fastmap_1.1.0 rhdf5filters_1.12.1
## [61] fansi_1.0.4 digest_0.6.31 timechange_0.2.0
## [64] R6_2.5.1 gridGraphics_0.5-1 colorspace_2.1-0
## [67] GO.db_3.17.0 RSQLite_2.3.1 utf8_1.2.3
## [70] generics_0.1.3 data.table_1.14.8 rtracklayer_1.60.0
## [73] prettyunits_1.1.1 graphlayouts_1.0.0 httr_1.4.4
## [76] htmlwidgets_1.6.2 S4Arrays_1.0.4 scatterpie_0.1.9
## [79] pkgconfig_2.0.3 gtable_0.3.1 blob_1.2.3
## [82] XVector_0.40.0 shadowtext_0.1.2 htmltools_0.5.4
## [85] fgsea_1.26.0 clue_0.3-64 scales_1.2.1
## [88] png_0.1-8 ggfun_0.0.9 knitr_1.42
## [91] rstudioapi_0.14 tzdb_0.3.0 reshape2_1.4.4
## [94] rjson_0.2.21 nlme_3.1-162 curl_5.0.0
## [97] cachem_1.0.6 GlobalOptions_0.1.2 parallel_4.3.0
## [100] HDO.db_0.99.1 restfulr_0.0.15 pillar_1.8.1
## [103] vctrs_0.5.2 dbplyr_2.3.0 xtable_1.8-4
## [106] cluster_2.1.4 evaluate_0.20 cli_3.6.0
## [109] locfit_1.5-9.7 compiler_4.3.0 Rsamtools_2.16.0
## [112] rlang_1.0.6 crayon_1.5.2 labeling_0.4.2
## [115] plyr_1.8.8 stringi_1.7.12 viridisLite_0.4.1
## [118] BiocParallel_1.34.1 assertthat_0.2.1 munsell_0.5.0
## [121] Biostrings_2.68.1 lazyeval_0.2.2 GOSemSim_2.26.0
## [124] Matrix_1.5-1 patchwork_1.1.2 hms_1.1.2
## [127] bit64_4.0.5 Rhdf5lib_1.22.0 KEGGREST_1.40.0
## [130] highr_0.10 igraph_1.4.3 memoise_2.0.1
## [133] bslib_0.4.2 ggtree_3.8.0 fastmatch_1.1-3
## [136] bit_4.0.5 downloader_0.4 gson_0.1.0
## [139] ape_5.7-1